# Do not edit this cell
# course: 3654
# a: Project 1
# d: VT
Name: Thejus Poruthikode Unnivelan PID: thejuspu
Name: Andrew Visocan PID: avisocan
We have neither given nor received unauthorized assistance on this assignment. See the course sylabus for details on the Honor Code policy. In particular, sharing lines of solution code is prohibited.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import preprocessing
from datetime import datetime
companyFile = "S&P500_Companies.csv"
companies = pd.read_csv(companyFile)
companies.columns
To organize the companies on the S&P 500 into groups that flourished or struggled in the times of Covid-19, based on OHLCV data (Open, High, Low, Close, Volume). companyFile is a file containing details on the 509 companies on the S&P500, including ticker symbol, company name, their sector, and their industry (individual industries make up sectors).
OHLCV = {}
for i in companies['Ticker']:
temp = pd.read_csv("./DATA/" + i + ".csv")
dates = temp['date']
dateTimes = []
for j in dates:
dateTimes.append(datetime.strptime(j, '%Y-%m-%d'))
temp['date'] = dateTimes
temp.set_index('date', inplace=True)
OHLCV[i] = temp
OHLCV['A'].columns
This method normalizes the data. SKLearn's Preprocessing's MinMaxScaler allows you to scale s ndarray of values between a desired range (we used (0, 1) and that is the default) by using the max and min value within the ndarray.
We wanted all the output to open at the same number so we divided by the initial opening value, after adding 1 to the numerator and denomenator. The division makes values less than the opening value between 0 and 1 and the values greater than the opening value large. So we found the natural log of the division and to avoid accidentally finding the natural log of zero, we added 1 to the numberator and denominator in the division step.
We have decided to name this plus, because we don't know whether this step will provide any useful insight into our problem, or whether this will help us create a better solution. So this has been sprinkled on top.
# Pre-req: data = OHLCV[ticker].drop('date', axis=1).drop('Volume', axis=1)
# and/or
# Pre-req: data = OHLCV[ticker].Volume
def normalize(data):
normaliser = preprocessing.MinMaxScaler()
temp = data.values
temp = normaliser.fit_transform(temp)
# temp = (temp + .001) / (temp[0][0] + .001)
# temp = np.log(temp)
return temp
def normalizePlus(data):
normaliser = preprocessing.MinMaxScaler()
temp = data.values
temp = normaliser.fit_transform(temp)
temp = (temp + .001) / (temp[0][0] + .001)
temp = np.log(temp)
return temp
def normOHLCV(data):
temp = normalize(data.drop('Volume', axis=1))
temp = pd.DataFrame(data=temp, columns=['Open', 'High', 'Low', 'Close'], index = data.index)
temp['Volume'] = data['Volume']
return temp
def normOHLCVplus(data):
temp = normalizePlus(data.drop('Volume', axis=1))
temp = pd.DataFrame(data=temp, columns=['Open', 'High', 'Low', 'Close'], index = data.index)
temp['Volume'] = data['Volume']
return temp
normPCP is a dictionary, key'd by the Ticker symbol, containing the set of three OHLCV data within the time frames and normalizes OHLC and Volume within those time frames.
Pre-Covid: [start, preCrashHigh]
Crash: [preGrashHigh, postCrashLow]
Post-Covid: [postCrashLow, end]
crash = datetime(2020, 2, 20)
start = datetime(2019, 8, 20)
end = datetime(2020, 8, 20)
preCrashHigh = datetime(2020, 2, 20)
postCrashLow = datetime(2020, 3, 23)
normPCP = {}
for i in companies['Ticker']:
data = OHLCV[i]
pre = data.loc[:preCrashHigh, :]
crash = OHLCV[i].loc[preCrashHigh:postCrashLow, :]
post = OHLCV[i].loc[postCrashLow:, :]
if (pre.shape[0] != 0):
normPCP[i] = (normOHLCV(pre), normOHLCV(crash), normOHLCV(post))
normPCPplus = {}
for i in companies['Ticker']:
data = OHLCV[i]
pre = data.loc[:preCrashHigh, :]
crash = OHLCV[i].loc[preCrashHigh:postCrashLow, :]
post = OHLCV[i].loc[postCrashLow:, :]
if (pre.shape[0] != 0):
normPCPplus[i] = (normOHLCVplus(pre), normOHLCVplus(crash), normOHLCVplus(post))
print('Removed ' + str(len(OHLCV.keys()) - len(normPCP.keys())) + ' Companies for not containing long enough information' )
preNorm = pd.DataFrame()
crashNorm = pd.DataFrame()
postNorm = pd.DataFrame()
for i in normPCP.keys():
setPCP = normPCP[i]
preNorm[i] = setPCP[0].Open
crashNorm[i] = setPCP[1].Open
postNorm[i] = setPCP[2].Open
preNorm['date'] = setPCP[0].index
preNorm.set_index('date', inplace=True)
crashNorm['date'] = setPCP[1].index
crashNorm.set_index('date', inplace=True)
postNorm['date'] = setPCP[2].index
postNorm.set_index('date', inplace=True)
preNormPlus = pd.DataFrame()
crashNormPlus = pd.DataFrame()
postNormPlus = pd.DataFrame()
for i in normPCPplus.keys():
setPCP = normPCPplus[i]
preNormPlus[i] = setPCP[0].Open
crashNormPlus[i] = setPCP[1].Open
postNormPlus[i] = setPCP[2].Open
preNormPlus['date'] = setPCP[0].index
preNormPlus.set_index('date', inplace=True)
crashNormPlus['date'] = setPCP[1].index
crashNormPlus.set_index('date', inplace=True)
postNormPlus['date'] = setPCP[2].index
postNormPlus.set_index('date', inplace=True)
for i in set(companies['Sector']):
print('\t\t\t', '\033[1m' + i + '\033[0m')
sector = companies[companies['Sector'] == i]
for j in set(sector['Industry']):
print(j)
industry = sector[sector['Industry'] == j]
fig, ((preAx, preAxPlus), (crashAx, crashAxPlus), (postAx, postAxPlus)) = plt.subplots(nrows=3, ncols=2, figsize=(15,15))
tickers = []
for k in industry.index:
ticker = industry.loc[k, 'Ticker']
if ticker in normPCP.keys():
print(ticker, ':\t', industry.loc[k, 'Company'])
tickers.append(ticker)
# fig.suptitle('Industry ' + j)
preAx.set_title('Pre-Covid')
crashAx.set_title('Covid Crash')
postAx.set_title('Post-Covid')
preAxPlus.set_title('Pre-Covid Plus')
crashAxPlus.set_title('Covid Crash Plus')
postAxPlus.set_title('Post-Covid Plus')
# fig.subplots_adjust(hspace = 0.3)
for i in tickers:
preNorm[i].plot(y=tickers, ax=preAx)
crashNorm[i].plot(y=tickers, ax=crashAx)
postNorm[i].plot(y=tickers, ax=postAx)
preNormPlus[i].plot(y=tickers, ax=preAxPlus)
crashNormPlus[i].plot(y=tickers, ax=crashAxPlus)
postNormPlus[i].plot(y=tickers, ax=postAxPlus)
plt.show()
print()
print()
From these graphs we have given us a deeper understanding of how the market works and how we want to analyze the market for our problem. We analyzed these graphs by industry because the industries are an inbuilt clustering method, that allows us to gain insight into markets with high competition and markets that all respond simultaneously to external factors. There are a two key take-aways from these graphs
The Peak before the crash and the trough after the crash is different for each ticker. Because we used the dates for the peak and trough from the S&P 500, we couldn't see that until we created these graphs. For the Plus version of the normalization, this fact makes a big difference because the initial normalization is divided by the initial data point, and that can be exorbitantly low or high.
The Plus version doesn't seem to gleam much insight except for making market certainty and uncertainty very apparent. However, once 1. is addrssed the plus version may give improved insights into the market conditions. Another solution to this problem could be to divide by every since data-point in the time-series, which would technically end up analyzing every data point to every other datapoint which could become beneficial information-wise, but computationally costly.
But at the end of it all, the normalization of the market opens alone doesn't seem to provide enough information to make clean insights. The normal normalization doesn't provide insights into how some tickers outperform competitors week-to-week, day-to-day, or month-to-month because all the data is stuck between 0 and 1. On the other hand Normalization Plus seems promise a lot, and is now asking for more for those same promises. Feels a lot like corruption.
A good idea worth investing into is analyzing the day to day difference in opening values, or the daily difference is highs and lows and making them positives and negatives depending on the relativity of the opening and closing values.
print(len(set(companies['Sector'])), 'Sectors Total')
for i in set(companies['Sector']):
sector = companies[companies['Sector'] == i]
print(len(set(sector['Industry'])), '\t', '\033[1m' + i + '\033[0m')
sector = companies[companies['Sector'] == i]
for j in set(sector['Industry']):
print(j)
print()
for i in set(companies['Sector']):
print('\t\t\t', '\033[1m' + i + '\033[0m')
sector = companies[companies['Sector'] == i]
for j in set(sector['Industry']):
print(j)
industry = sector[sector['Industry'] == j]
for k in industry.index:
print(industry.loc[k, 'Ticker'], ':\t', industry.loc[k, 'Company'])
print()
print()
The first thing we tried to do was try to find patterns by looking at the Candle Stick Graph, which ended up taking too much time and couldn't overlay multiple candle-stick graphs on top of each other so we decided to use a line graph for this analysis instead.
Instead if we want to view any single graph we can use the cell two cells down to create an interactive candle-stick graph. This has proved useful in diagnosing problems, like the one we found with CARR, where the ticker only starts after the crash , so we end up lacking data points.
import plotly.graph_objs as go # pip install plotly in Anaconda Prompt
def createCandleStick(data):
plot=[go.Candlestick(x=data.index,
open=data.Open,
high=data.High,
low=data.Low,
close=data.Close)]
figSignal = go.Figure(data=plot)
figSignal.show()
ticker = 'CARR'
createCandleStick(OHLCV[ticker])
Andrew
Thejus
Part 2
The biggest change was moving the data collection, trimming, and saving to a new .ipynb file. This allows all the analysis to be run from top to bottom really easily, and without triggering the Alpha Vantage, which is very time-consuming.
Since normalization was done with all the OHLC data, we wanted to use the candle-sticker graph to display the findings, but the candlestick graph is very hard to over-lay on other graphs, and the plotly library is excruciatingly slow. But the candle-stick graph has proved useful.